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ABSTRACT 

With the current trend of multiprocessor machines towards 
more and more hierarchical architectures, exploiting the full 
computational power requires careful distribution of exe- 
cution threads and data so as to limit expensive remote 
memory accesses. Existing multi-threaded libraries provide 
only limited facilities to let applications express distribu- 
tion indications, so that programmers end up with explic- 
itly distributing tasks according to the underlying architec- 
ture, which is difficult and not portable. In this article, we 
present: (1) a model for dynamically expressing the struc- 
ture of the computation; (2) a scheduler interpreting this 
model so as to make judicious hierarchical distribution deci- 
sions; (3) an implementation within the Marcel user-level 
thread library. We experimented our proposal on a scientific 
application running on a ccNUMA Bull NovaScale with 
16 Intel Itanium II processors; results show a 30% gain 
compared to a classical scheduler, and are similar to what a 
handmade scheduler achieves in a non-portable way. 

1. INTRODUCTION 

"Disable HyperThreading!" That is unfortunately the most 
common pragmatic answer to performance losses noticed 
on HyperThreading-capable processors such as the Intel 
Xeon. This is of particular concern since hierarchy depth 
has increased over the past few years, making current com- 
puter architectures more and more complex (Sun WildFire 
[TP] . Sgi Origin pU, Bull NovaScale for instance). 

Those machines look like Russian dolls: nested technologies 
allow them to execute several threads at the same time on 
the same core of one processor (SMT: Simultaneous Multi- 
Threading), to share cache memory between several cores 
(multicore chips), and finally to interconnect several multi- 
processor boards (SMP) thanks to crossbar networks. The 
resulting machine is a NUMA (Non-Uniform Memory Ar- 
chitecture) computer, on which the memory access delay 
depends on the relative positions of processors and memory 
banks (this is called the "NUMA factor"). 

The recent integration of SMT and multicore technologies 
make the structure of NUMA machines even more com- 
plex, yet operating systems still have not exploited previous 
NUMA machines efficiently. Hennessy and Patterson under- 
lined that fact about systems proposed for SGI Origin 
and Sun Wildfire: "There is a long history of software 



lagging behind on massively parallel processors, possibly be- 
cause the software problems are much harder. " The intro- 
duction of new hardware technologies emphasizes the need 
for software development. Our goal is to provide a portable 
solution to enhance the efficiency of high-performance multi- 
threaded applications on modern computers. 

Obtaining optimal performance on such machines is a signif- 
icant challenge. Indeed, without any information on tasks' 
affinity, it is difficult to make good decisions about how to 
group tasks working on a common data set on NUMA nodes. 
Detecting such affinity is hard, unless the application itself 
somehow expresses it. 

To relieve programmers from the burden of redesigning the 
whole task scheduling mechanism for each target machine, 
we propose to establish a communication between the execu- 
tion environment and the application so as to automatically 
get an optimized schedule. The application describes the 
organization of its tasks by grouping those that work on 
the same data (memory affinity) for instance. The system 
scheduler can then exploit this information by adapting the 
task distribution to the hierarchical levels of the machine. 

Of course, a universal scheduler that would get good results 
by using only such a small amount of information remains 
to be written. In the meantime, we provide facilities for 
applications to query the system about the topology of the 
underlying architecture and "drive" the scheduler. As a re- 
sult, the programmer can easily try and evaluate different 
gathering strategies. More than a mere scheduling model, 
we propose a scheduling experimentation platform. 

In this article, we first present the main existing approaches 
that exploit hierarchical machines, then we propose two new 
models describing application tasks and hierarchical levels of 
the machine, as well as a scheduler that takes advantage of 
them. Some implementation details and evaluation results 
are given before concluding. 



2. EXPLOITING HIERARCHICAL 
MACHINES 

Nowadays, multiprocessor machines like NUMAs with multi- 
threaded multicores are increasingly difficult to exploit. Sev- 
eral approaches have been considered. 



2.1 Predetermined distribution and 
scheduling 

For very regular problems, it is possible to determine a task 
schedule and a data distribution that are suited to the target 
machine and its hierarchical levels. The application just 
needs to get the system to apply that schedule and that 
distribution, and excellent (if not optimal) performance can 
be obtained. The PaStiX[H| large sparse linear systems 
solver is a good example of this approach. It first launches 
a simulation of the computation based on models of BLAS 
operators and communications on the target architecture. 
Then it can compute a static schedule of block-computations 
and communications. 

So as to enforce these scheduling strategies, many systems 
(Aix, Linux, Solaris, Windows, ...) allow process threads 
to be bound to processor sets, and memory allocations to 
be bound to memory nodes. Provided that the machine 
is dedicated to the application, the thread scheduling can 
be fully controlled by binding exactly one thread to each 
processor. To perform task switching, mere explicit context 
switches may be used: threads are only used as execution 
flow holders. 



2.2 Opportunist distribution and scheduling 

Greedy algorithms (called Self-Scheduling (SS) |27|1 are dy- 
namic, flexible and portable solutions for loop paralleliza- 
tion. Whatever the target machine, a Self-Scheduling algo- 
rithm takes care of both thread scheduling and data distri- 
bution. Operating systems schedulers are based on these 
algorithms. 

They basically use a single list of ready tasks from which 
the scheduler just picks up the next thread to be sched- 
uled. Hence the workload is automatically distributed be- 
tween processors. For each task, the last processor on which 
it was scheduled is recorded, so as to try to reschedule it 
on the same processor as much as possible to avoid cache 
misses. These techniques are used in the Linux 2.4 and 
Windows 2000 25, operating systems. However, a unique 
thread list for the whole machine is a bottleneck, particu- 
larly when the machine has many processors. 

To avoid such contention, Guided Self-Scheduling (GSS) 22 
and Trapezoid Self-Scheduling (TSS) |30| algorithms make 
each processor take a whole part of the total work when they 
are idle, raising the risk of imbalances. AFfinity Scheduling 
(AFS) 1 151 and Locality-based Dynamic Scheduling (LDS) 
14 algorithms use a per-processor task list. Whenever idle, 
a processor will steal work from the least loaded list, for 
instance. These latter algorithms are used by current oper- 
ating systems (Linux 2.6 0, FreeBSD 5.0 |53|. Cellular 
Irix |33|). They also add a few rebalance policies: new pro- 
cesses are charged to the least loaded processor, for instance. 

However, contention appears quickly with an increased num- 
ber of processors, particularly on NUMA machines. Wang 
et al. propose a Clustered AFfinity Scheduling (CAFS) 
|3*T| algorithm which groups p processors in groups of ^/p. 
Whenever idle, rather than looking around the whole ma- 
chine, processors steal work from the least loaded proces- 
sor of their group, hence getting better localization of list 



accesses. Moreover, by aligning groups to NUMA nodes, 
data distribution is also localized. Finally, the Hierarchical 
AFfinity Scheduling (HAFS) (Wang et al. .32.) algorithm 
lets any idle group steal work from the most loaded group. 
This latter approach is being considered for latest NUMA- 
aware developments of operating systems such as Linux 2.6 
and FreeBSD. 



2.3 Negotiated distribution and scheduling 

There are intermediate solutions between predetermined and 
opportunist scheduling. Some language extensions such as 
OpenMP ITJJ, HPF (High Performance Fortran) or 
UPC ( Unified Parallel C) let one achieve parallel pro- 
gramming by simply annotating the source code. For in- 
stance, a for loop may be annotated to be automatically 
parallelized. An HPF matrix may be annotated to be auto- 
matically split into rather independent domains that will be 
processed in parallel. 

The distribution and scheduling decisions then belong to 
the compiler. To do this, it adds code to query the exe- 
cution environment (the number of processors for instance) 
and compiles the program in a way generic enough to adapt 
to the different parallel architectures. In particular, it will 
have to handle threads for parallelized loops or distributed 
computing, and even handle data exchange between proces- 
sors (in the case of distributed matrices of HPF). To date, 
expressiveness is limited mostly to "Fork- Join" parallelism, 
which means, for instance, that the programmer can not ex- 
press imbalanced parallelism. 

Programmers may also directly write applications that are 
able to adapt themselves to the target machine at runtime. 
Modern operating systems provide full information about 
the architecture of the machine (user-level libraries are avail- 
able: lgroup for SOLARIS or numa 1 for Linux). The 
application can then not only get the number of processors, 
but also get the NUMA nodes hierarchy, their respective 
number of processors and their memory sizes. Those sys- 
tems also let the application choose the memory allocation 
policy (specific memory node, first touch or round robin) and 
bind threads to CPU sets. Thus, the application controls 
threads and memory distribution, but it is then in charge of 
balancing threads between processors. 

2.4 Discussion 

We chose to classify existing approaches into three cate- 
gories. The predetermined category gives excellent perfor- 
mance. But it is portable only if the problem is regular, i.e., 
its solving time depends on the data structure and not on 
the data itself. The opportunist approach scales well, but 
does not take task affinities into account, and thus, on av- 
erage, does not get excellent performance. The negotiated 
approach lets the application adapt itself to the underlying 
machine, but requires rewriting of some parts of the sched- 
uler in order to be flexible. 

Our proposal is a mix between negotiated and opportunist 
approaches. We will give the programmers means to dy- 
namically describe how their applications behave, and use 
this information to guide a generic opportunist scheduler. 



3. PROPOSAL: AN APPLICATION- GUIDED 
SCHEDULER 

Our proposal is based on a collaboration between the appli- 
cation and its execution environment. 

3.1 Bubbles modeling the application 
structure 

The application is asked to model the general layout of its 
threads in terms of nested sets called bubbles 1 . 

FigureQshows such a model: the application groups threads 
into pairs, along with a communication thread (priorities 
will be discussed later). The concept of bubbles can be un- 
derstood as a coset with respect to a specific affinity rela- 
tion, and bubble nesting expresses refinement of a relation 
by another one. Indeed, several affinity relations can be 
considered, for instance: 

Data sharing It is a good idea to group threads that work 
on the same data so as to benefit from cache effects, 
or at least to avoid spreading the data throughout 
the NUMA nodes thereby incurring the NUMA fac- 
tor penalty. 

Collective operations It can be beneficial to optimize the 
scheduling of threads which are to perform collective 
operations such as a synchronization barrier, which en- 
sures that all involved threads have reached the barrier 
before they can continue executing. 

SMT Many attempts were made to address thread schedul- 
ing on Simultaneous Multi-Threading (SMT) proces- 
sors, mostly by detecting affinities between threads at 
runtime |Hll7|. Indeed, in some cases, pairs of threads 
may be able to efficiently exploit the SMT technology: 
they can run in parallel on the logical processors of the 
same physical processor without interfering. If the pro- 
grammer knows that some pairs of threads can work 
in such symbiosis, he can express this relation. 

Other relations may be possible to express parallelism, se- 
quentiality, preemption, etc. Yet, blindly expressing these 
relations may also be detrimental: Bulpin and Pratt show 
performance loss [3] on SMT processors due to frequent 
cache misses for instance; Antonopoulos et al. also show 
performance loss [5j when not taking the SMP bus band- 
width limit into account. But the programmer may try and 
test different refinements of the relations and thus experi- 
mentally reveal how the threads of an application should be 
related. 

In order to cope with the emerging multiprocessor networks 
of the 1980's, Ousterhout [22 proposed to group data 
and threads by affinity into gangs. These gangs hold a fixed 
number of threads which are to be launched at the same 
time on the same machine of the network: this is called Gang 
Scheduling. However, processors may be left idle because a 
single machine can only run one gang at a time, even if it is 

x In a way relatively similar to some communication libraries 
such as MPI, which ask the application to specify commu- 
nicators: groups of machines which will communicate. 



"small" . Feitelson et al. [5] propose a hierarchical control 
of the processors so as to execute several gangs on the same 
machine. Our approach is actually a generalization of this 
approach. 

3.2 Task lists modeling the computing power 
structure 

According to Dandamudi and Cheng 0, a hierarchy of 
task lists generally brings better performance than simple 
per-processor lists. This is why two-level list schedulers have 
been developed |9l 120|. Moreover, it makes task binding to 
processor sets easier. In a manner similar to NlKOLOPOULOS 
et al.'s Nano- Threads list hierarchy |19| . we have taken up 
and generalized this point of view. 

Indeed, we model hierarchical machines by a hierarchy of 
task lists. Each component of each level of the hierarchy of 
the machine has one and only one task list. Figure [2] shows 
a hierarchical machine and its model. The whole machine, 
each NUMA node, each core, each physical SMT processor 
and each logical SMT processor has a task list. 

For a given task, the list on which it is inserted expresses 
the scheduling area: if the task is on a list associated with a 
physical chip, it will be allowed to be run by any processor 
on this chip; if it is placed on the global list, it will be allowed 
to be run by any processor of the machine. 

3.3 Putting both models together: a bubble 
scheduler 

Once the application has created bubbles, threads and bub- 
bles are just "tasks" that the execution environment dis- 
tributes on the machine. 

3.3.1 Bubble evolution 

As Figure [3] shows, the goal of a bubble is to hold tasks 
and bring them to the level where their scheduling will be 
most efficient. For this, the bubble goes down through lists 
to the wanted hierarchical level. It then "bursts", i.e. held 
threads and bubbles are released and can be executed (or go 
deeper). The list of held tasks is recorded, for a potential 
later regeneration (see Section llj.3.311 . The main issue is how 
to specify the right bursting level of a bubble. 

In the long run, once we get good heuristics for a bubble 
scheduler, specifying such a parameter will no longer be 
necessary. For now, the goal is to provide an experimental 
platform for developing schedulers, and hence allow this pa- 
rameter to be tuned by the scheduler developers. They can 
favor task affinity with the risk of making the load balance 
difficult (by setting deep bursting levels) or on the contrary 
favor processor use (by setting high bursting levels). 

3.3.2 Priorities 

We choose to let the application attach integer priorities to 
tasks. When a processor looks for a task to be scheduled, it 
searches through the lists that "cover" this processor, from 
the most local one {i.e. on low levels) to the most global 
one, looking for a task with highest priority. It will then 
schedule that task, even if less prioritized tasks remain on 
more local lists. 





Figure 1: Bubble example, with priorities: thread pairs that have a higher priority than the bubbles holding them, 
and a highly prioritized thread. 



(a) A NUMA of Hyper Threaded multicores. 
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(b) Model with task lists. 
Figure 2: A high-depth hierarchical machine and its model. 



Figure^shows an example using priorities. In this example, 
bubbles holding computing threads are less prioritized than 
the threads. Consequently, a bubble will burst only if every 
thread of the previously burst bubbles has terminated, or if 
there are not enough of them to occupy all the processors. 
This results in some Gang scheduling which automatically 
occupies all the processors. 

3.3.3 Bubble regeneration 

Bubbles are automatically distributed by the scheduler over 
the different levels of task lists of the machine, hence dis- 
tributing threads on the whole machine while taking affin- 
ity into account. However, it is possible that a whole thread 
group has far less work than others and terminates before 
them, leaving idle the whole part of the machine that was 
running it. 

To correct such imbalance, some bubbles may be regener- 
ated and moved up. Idle processors would then move some 
of them down on their side and have them re-burst there, 
getting a new distribution suited to the new workload while 
keeping affinity intact. 



To prevent such imbalances, bubbles may periodically be 
regenerated 2 : each bubble has its own time slice after which 
its threads are preempted and the bubble regenerated. 

In the case of Figure Q the preemption mechanism is ex- 
tended to Gang Scheduling: whenever a bubble is regener- 
ated (because its time slice expired), it is put back at the 
end of the task list while another bubble is burst to occupy 
the resulting idle processors. 

3.4 Discussion 

Bubbles give programmers the opportunity to express the 
structure of their application and to guide the scheduling of 
their threads in a simple, portable and structured way. Since 
the roles of processors and other hierarchical levels are not 
predetermined, the scheduler still has some degrees of free- 
dom and can hence use an opportunist strategy to distribute 
tasks over the whole machine. By taking into account any 
irregularity in the application, this scheduler significantly 
enhances the underlying machine exploitation. Such pre- 

2 In a way similar to Unix system thread preemption. 





Figure 3: Bubble evolution, (a) The outermost bubble starts on the general list, (b) It bursts, releasing a thread 
(which can immediately be scheduled on any processor) and two sub-bubbles which can go down through the hierarchy, 
(c) Going down achieved, (d) Both sub-bubbles burst, releasing two threads each, (e) Threads are distributed 
appropriately and can start in parallel. 



marcel_t threadl, thread2; 
marcel_bubble_t bubble ; 

marcel_bubble_init (febubble) ; 

marcel_create_dontsched(&threadl , NULL, funl, paral) ; 

marcel_create_dontsched(&thread2, NULL, fun2, para2) ; 

marcel_bubble_inserttask(&bubble , threadl) ; 
marcel_wake_up_bubble (&bubble) ; 

marcel_bubble_inserttask(&bubble , thread2) ; 



Figure 4: Bubble creation example: threads are created 
without being started, then they are inserted in the same 
bubble. 



ventive rebalancing techniques may still have side effects and 
lead to pathological situations (ping-ponging between tasks, 
useless bubble migration just before termination, etc.). 

4. IMPLEMENTATION DETAILS 

Marcel |18l I7| is a two-level thread library: in a way sim- 
ilar to manual scheduling (see section 12. It . it binds one 
kernel-level thread on each processor and then performs fast 
user-level context switches between user-level threads, hence 
getting complete control on threads scheduling 3 in userland 
without any further help from the kernel. Our proposal was 
implemented within Marcel's user threads scheduler. 

Figure 2] shows an example of using the interface to build 
and launch a bubble containing two Marcel threads. 

The Marcel scheduler already had per-processor thread 
lists, so that integrating bubbles within the library did not 

3 We suppose that no other application is running, and ne- 
glect system daemons wake-ups. 



need a thorough rewriting of the data structures. The sched- 
uler code was modified to implement list hierarchy, bubble 
evolution and to take priorities (described in Section f3.3.21 
into account. 

So as to avoid contention, there is no global scheduling: pro- 
cessors just call the scheduler code themselves whenever they 
preempt (or terminate) a thread. The scheduler finds some 
thread that is ready to be executed by the processor. We 
added bubble management there: while looking for threads 
to execute, the scheduler code now also tries to "pull down" 
bubbles from high list levels and make them burst on a more 
local level. Getting an efficient implementation is complex, 
as explained below. 

Given a processor, two passes are done to look for the task 
(thread or bubble) with maximum priority among all the 
tasks of the lists "covering" that processor. The first pass 
quickly finds the list containing the task with the highest 
priority, without the need of a lock. That list and the list 
holding the currently running task are locked 4 . A second 
pass is then used to check that the selected list still has a 
task of this priority, in case some other processor took it in 
the meantime. If the selected task is a thread, it is sched- 
uled; otherwise it is a bubble that the processor deals with 
appropriately (going down / bursting). The implementa- 
tion time-complexity is linear with respect to the number of 
hierarchical levels of the machine. 

Regenerating a bubble is also a difficult operation. Replac- 
ing threads in a given bubble requires removing all of them 



4 By convention, locking lists is done by locking high-level 
lists first, and for a given level, according to the level ele- 
ments identifiers. 
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Table 1: Cost of the modified Marcel scheduler for 
searching lists, compared to other schedulers. Yield: list 
search only, Switch: synchronization and context switch. 



from the task lists, except threads being executed. Those 
threads go back in the bubble by themselves when the pro- 
cessors executing them call the scheduler. Eventually, the 
last thread closes the bubble and moves it up to the list 
where it was initially released by the bubble holding it. 

5. PERFORMANCE EVALUATION 

Our algorithm has some cost, but increases performance 
thanks to the resulting localization. 

5.1 Bubble scheduler cost 

We measured the performance impact of our implementation 
on the Marcel library running on a 2.66 GHz Pentium 
IV Xeon. Searching through lists has a reasonable cost, 
and our scheduler execution times are good compared to 
the Linux thread libraries LinuxThread (2.4 kernel) and 
NPTL (2.6 kernel), see Table □ 

Creation and destruction of a bubble holding a thread does 
not cost much more than creation and destruction of a sim- 
ple thread: the cost increases from 3.3/^s to 3.7/is. 

Test-case examples of recursive creation of threads, such as 
divide-and-conquer Fibonacci show that the cost of system- 
atically adding bubbles that express the natural recursion 
of threads creations is quickly balanced by the localization 
that they bring: Figure shows that performance is af- 
fected when only a few threads are created, while on a Hy- 
perThreaded BLPentium IV Xeon, the performance gain 
stabilizes at around 30 to 40% with 16 threads; on a NUMA 
4x4 Itanium II, the gain is 40% with 32 threads and gets 
up to 80% with 512 threads. 

5.2 A real application 

Marc Perache 123! used our scheduler in a comparison of 
the efficiency of various scheduling strategies for heat con- 
duction and advection simulations. Results may be seen in 
Table H The target machine is a ccNUMA Bull NovaS- 
CALE with 16 Itanium II processors and 64 GB of memory, 
distributed among 4 NUMA nodes. For a given processor, 
accessing the memory of its own node is about 3 times faster 
than accessing the memory of another node. The applica- 
tions perform cycles of fully parallel computing followed by 
global hierarchical communication barrier. 

In the simple version, the mesh is split into as many stripes 
as the number of processors, and an opportunist schedule is 
used. The bound version binds them to processors in a non- 
portable way. This gets far better performance: each thread 
remains on the same node, along with its data. Our proposal 
lets the application query Marcel about the number of 
NUMA nodes and processors and then automatically build 
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Figure 5: Performance gain brought by adding bubbles 
to the fibonacci test-case. 
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Table 2: Conduction performance depending on the ap- 
proach. 



bubbles according to the hierarchy of the machine (hence 
4 bubbles of 4 threads in this example) . It gets performance 
very similar to those of the bound version. 

As can be seen, the use of bubbles attained performance 
close to that which may be achieved with a "handmade" 
thread distribution, but in a portable way. 

These applications are a simple example in which the work- 
load is balanced between stripes. The use of bubbles sim- 
ply allowed it to automatically fit the architecture of the 
machine. However, in the future these applications will be 
modified to benefit from Adaptive Mesh Refinement (AMR) 
which increases computing precision on interesting areas. 
This will entail large workload imbalances in the mesh both 
at runtime and according to the computation results. It 
will hence be interesting to compare both development time 
and execution time of handmade-, opportunist-, and bubble- 
scheduled versions. 



6. CONCLUSION 

Multiprocessor machines are getting increasingly hierarchi- 
cal. This makes task scheduling extremely complex. More- 
over, the challenge is to get a scheduler that will perform 
"good" task scheduling on any multiprocessor machine with 
an arbitrary hierarchy, only guided by portable scheduling 
hints. 

In this paper, we presented a new mechanism making sig- 
nificant progress in that direction: the bubble model lets 
applications express affinity relations of varying degrees be- 
tween tasks in a portable way. The scheduler can then use 
these hints to distribute threads. 

Ideally, the scheduler would need no other information to 
perform this. But practically speaking, writing such a sched- 
uler is difficult and will need many experiments to be tuned. 
In the meantime, the programmer can use stricter guiding 
hints (indicating bubble bursting levels, for instance) so as 
to experiment with several strategies. 

Performance observations on several test-cases are promis- 
ing, far better than what opportunist schedulers can achieve, 
and close to what predetermined schedulers get. These ob- 
servations were obtained on several architectures (Intel PC 
SMP, Itanium II NUMA). 

This work opens numerous future prospects. In the short 
term, our proposal will be included within test-cases of real 
applications of CEA that run on highly hierarchical ma- 
chines, hence stressing the bubble mechanism power. It will 
then be useful to develop analysis tools based on tracing the 
scheduler at runtime, so as to check and refine scheduling 
strategies. It will also be useful to let the programmer set 
other attributes than just priorities, and thus influence the 
scheduler: "strength" of the bubble (which expresses the 
amount of affinity that the bubble represents) , preemptibil- 
ity, some notion of amount of work, ... 

In the longer term, the goal is to provide a means of ex- 
pression powerful and portable enough for the application 
to obtain an automatic schedule that gets close to the "op- 
timal" whatever the underlying architecture. It could also 
be useful to provide more powerful memory allocation func- 
tions, specifying which scope of tasks (a bubble for instance) 
will use the allocated area. 
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